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ABSTRACT 

We present a data mining framework for the analysis and discovery of anomalies in high-dimensional time 
series of sensor measurements that would be found in an Integrated System Health Monitoring system. We 
specifically treat the problem of discovering anomalous features in the time series that may be indicative of 
a system anomaly, or in the case of a manned system, an anomaly due to the human. Identification of these 
anomalies is crucial to building stable, reusable, and cost-efficient systems. The framework consists of an analy- 
sis platform and new algorithms that can scale to thousands of sensor streams to discovers temporal anomalies. 
We discuss the mathematical framework that underlies the system and also describe in detail how this frame- 
work is general enough to encompass both discrete and continuous sensor measurements. We also describe 
a new set of data mining algorithms based on kernel methods and hidden Markov models that allow for the 
rapid assimilation, analysis, and discovery of system anomalies. We then describe the performance of the sys- 
tem on a real-world problem in the aircraft domain where we analyze the cockpit data from aircraft as well as 
data from the aircraft propulsion, control, and guidance systems. These data are discrete and continuous sensor 
measurements and are dealt with seamlessly in order to discover anomalous flights. We conclude with recom- 
mendations that describe the tradeoffs in building an integrated scalable platform for robust anomaly detection 
in ISHM applications. 


INTRODUCTION 

Modem ISHM systems contain hundreds or thousands of sensors producing discrete and continuous mea- 
surements. The union of all the sensor signals at a given time t can be considered the observed state of the 
system. The entire record of the sensor measurements represents the evolution of the system through time. In 
this paper we focus on the situation where the sensor measurements are producing discrete signals. Discovering 
anomalies in continuous systems has been extensively treated in the data mining and statistics literature [5, 6]. 
We assume that the observed system evolution can be functionally described by the following equations: 

h t = HK-i) (1) 

x ( = (2) 

We assume that the function <£ determining the evolution of the system state h t is unknown. We also discuss 
two situations, one where the state h t is observable, and one where it is assumed to be hidden. The vector x 
is an N dimensional observed binary state vector, and xj'_ 1 is the entire history of the observed state vector: 
= [x 0 ,xx, ..., x t _x]. The quantity u t is the observed input to the system. Because we assume that we do 
not know the function 4>, we cannot rely on it to help us determine whether or not the observed vector x t is 
anomalous. 

The problem that we address in this paper is to develop a method to discover whether or not the current 
observed state x t is anomalous or not based on the observed history of the system, as well as replicates of the 
system behavior. Thus, let X be a T x N matrix whose tth row AT(£, :) contains the values of the N binary sensors 
at time t during a single run of the system. Likewise, the nth column X(:, n) contains the time series representing 
the time-ordered sequence of states of the nth sensor. Different replicates of the system are indexed by l and are 
represented by different X;, for l = 1 . . . L. 
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We make two assumptions regarding the sensor measurements in order to develop the ideas in this paper: 

• the times in each X/ have all been normalized so that t = 0 (i.e. the first row of X/) corresponds to the start 
of the system. 

• all replicates of the system l under consideration correspond to the the system evolving under the same 
basic operating conditions. 

In order to ground this example to a real system, we consider the situation where there are N « 1000 dif- 
ferent binary switches in an aircraft cockpit h We assume that we have a recording of length T ~ 1000 of 
the switch settings, sampled at uniform intervals. Note, however, that each flight may have a different value 
of T, since different flights have different durations, etc. As the pilot maneuvers the airplane through its vari- 
ous phases of flight, the pilot flips these switches. We assume that we have L & 10, 000 different flights recorded. 

Clearly, the vast majority of the flights in both civilian and military aircraft are such that the pilot flips the 
appropriate switches at the appropriate times in the appropriate order. Due to standard operating procedures 
(SOPs) and extensive training, one would expect the deviation between different flights to be small. 

This paper addresses the problem of identifying anomalies in the sequences of states of the switches that are 
observed. The anomaly is only defined with respect to the sample of L different flights, or replicates, since we 
do not have an explicit model of how the system evolves through time, i.e., we do not, and cannot know $ or ^ 
in Equations 1 and 2. 

Note that the data sets involved in this simple example are extremely large. For the values given above, the 
data set D — {Xi, • ♦ • , X^} is on the order of 1.25 gigabytes in size. In the event that the data set D is refreshed 
daily, we have an incredible escalation in data set size and associated storage and maintenance costs. The meth- 
ods that we develop need to be able to scale to the massive data sets without requiring supercomputing power 
and significant storage media. With these assumptions we would like to build a model P(X) from a data set 
D = {Xi, • • • , Xl}. The data X/ arise from different airlines flying the same route using the same airplane. 
Given that we are going to have a large volume of data any model for P(X) must be able to learned and probed 
quickly 2 


TWO MODELS OF THE DATA GENERATING PROCESS 

We next describe two models of the data generating process that obey Equations 1 and 2, but that have 
very different statistical characteristics, and therefore very different implications on the ISHM problem. The first 
model assumes that the data are generated by a process where each observation is probabilistically independent 
of each other observation, and that the state h t is fully observed. Since the state vector is fully observed, we can 
assume that it is part of the observation vector x* without loss of generality. The independence can be expressed 
as the following expression, for all £, t\ n, n f : 

P(x t (n),x f /(n / )) = P(x t (n))P(x t /(n')) (3) 

This data set represents a simple 'baseline 7 model which we can use to evaluate the performance of the algo- 
rithms that we describe in this paper. In order to generate data that matches this assumption, we create a T x N 
data set where (T — 1000, N — 500) using binomial random variables with parameters (n = l.p — 0.3). We 
will refer to this data set as the baseline data set. A key aspect of this data set is that we assume that the sensor 
measurements are a full representation of the state of the system. Thus, methods that are successful in finding 
faults in this data set would be those that do some sort of envelope detection either on the sensor measurements 
themselves or a transformed representation of the measurements. This data set will illustrate the properties of 

x Note that switches with m multiple settings can always be converted to m binary signals. 

2 As the data will be accumulated daily it might also be important that the model can easily be updated incrementally. 
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envelope detection algorithms that are commonplace in the ISHM literature. 

The second model of the data generating process is much more complex, but it allows for unobserved system 
anomalies and dynamically, evolving states and observable vectors. We will refer to this data set as the dynamic 
data set . The model is based on the Hidden Markov Model, which is widely used in speech recognition [7] and 
has also been discussed in the IVHM/ISHM literature [1, 3]. We next briefly describe the Hidden Markov Model 
and its applications to IVHM. 

The HMM is a dynamic model consisting of H hidden states that are assumed to be not directly observable, 
with S observation symbols associated with each state. The model has an initial probability distribution i r which 
is an 5 x 1 vector where 7 r* is the probability that the system begins in state i. The HMM starts in an initial state 
according to the distribution 7r and then the hidden state h t moves to the next state based on a Markov transition 
matrix A. The matrix A % j = P{h t +i = j\h t = i ) which gives the probability distribution of the next hidden state 
given the current state. While the system is in state i it is assumed to emit an observation vector x t according to 
the distribution Bjk = P(x t = k\h t = j). 

Notice that the HMM assumes a discrete state space for the hidden variable h t/ and that it is assumed that 
the observation vector x t is also discrete. In our case, the number of symbols S & 2 N where N — 500. Thus, we 
need to perform a preprocessing step to identify a small number of representative states. This step is as much art 
as science, and is usually performed through the use of a clustering algorithm. Banerjee et. al. [2] have given an 
excellent formulation to the problem of clustering high dimensional data based. We use their algorithm along 
with standard clustering methods such as the k-means algorithm for grouping observations into symbols. 

The implications of the HMM formulation for ISHM problems in interesting. The HMM allows for modeling 
situations where we assume that the system state is evolving according to an unobserved set of dynamics defined 
by the Markov transition matrix. As the system evolves, the state may go into one or more 'failure modes' that 
not directly observable at the sensors. However, based on the distribution of observed sensor measurements, it 
may be possible to determine a failed state. This is the avenue which we explore in this paper. In the simulations 
for this paper, we chose H — 4, S = 6, and the probability matrices A and B such that the system can fall into a 
failure mode (state 6) with probability 0.05. When the HMM construct is applied to real data, these values will 
generally be much larger. 

As noted above, the data set size is potentially very large. For the systems described in this paper, the data 
is provided as the set of matrices D = {Xi, ■ • , Xx,}. The first observation that can be made is that the matrices 
X^ are all highly sparse, meaning that most of the data is all not changing at any given time. Thus, we converted 
the data into a very compact representation as follows: 

1. Record the initial values for all binary variables x 0 . This vector represents the initial state of the system at 
time 0. 

2. Record only those variables that transition from the initial state to the next state. Since the variables are all 
binary, this can be recorded simply as the index number of the switch that changed. 

3. If needed, record the time at which the transition occurred. 

This representation of the data reduces the data set size significantly. Our experiments indicate that for real data 
sets from approximately 10,000 flights, the data set size drops to approximately 1% of the original size with- 
out any loss of information. However, this data representation imposes certain constraints on the underlying 
algorithms. We will discuss this tradeoff in the next sections. 

MODELING APPROACH 

At opposite extremes there are two potential simplifications to make regarding this problem. We could fo- 
cus either on the temporal aspects of the problem (non-stationary behavior across rows), or on the correlation 
aspects between the sensors (stationary behavior across columns). The most general kind of model, where we 
look at correlations (or higher order analogues) between X(t, n) and X(t f , n'), we discard as being to complex 
as a starting point. To obtain initial insight quickly we focus on a simple model for the time dependence and 
a simple model for the stationary behavior. However, our proposed approach can be extended to relax these 
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Figure 1: We computed the likelihood of each observation for both the benchmark data set and the dynamic 
data set as a function of time. A failure mode in the benchmark data set is apparent in the drop in the likelihood 
of the points around time step 500. These are shown in the top panel of plots for both data sets respectively. 
Notice that most of the data lies within the ±3 sigma confidence interval of the mean likelihood as indicated 
in the lower panel. The unimodal characteristic of the benchmark data set is clear in the lower left hand plot. 
The multimodal characteristic of the observation sequence is also apparent in the lower right hand plot. The 
observations are not indicative of a failure state. 

simplifying assumptions. 

In the simplest approximation we completely ignore the time dependence, and build a model for the likeli- 
hood of any particular row x of X. Different rows (i.e. different times) are assumed to be independent so that 
P(X) = n* P( x t) where t ranges over the rows of X and x t = X(t, :). Thus, we simplify things to the point of 
only having to build a model for P(x) (i.e. for a single row). 

INDEPENDENT SENSORS INDEPENDENT OF TIME 

The most naive model for an observation x assumes independence between columns (sensors), i.e. 

N 

p(x) = n p ( x ( n » w 

71=1 

where x(n) is the nth component of x. For binary data a Bernoulli assumption can be used to model the value 
of the bit x(n ) so that 

P{x{n))=p 1 - x ^(l-p n ) x ^ (5) 

where p n is the probability that the nth bit is zero. The maximum likelihood estimates for the parameters p n are 
obtained simply from frequency counts down the columns X/ (:, n) for all l in the data set D, and dividing by the 
total number of rows in D. 

We constructed this model based on the data and have plotted several key statistics in Figure ??. The plot 
indicates the negative log-likeihood of the data given the Bernoulli model 5. The negative log-likelihood is 
computed by taking the negative log of Equation 4. 

This plot is similar to those found in die change-point detection or the process control literature, in that it 
shows the likelihood of observations as a function of time. Observations that fall outside of the "expected" 
bounds can be tagged for further observation. As noted before, this model completely ignores the potential 
relationship between the binary sensors or switches, as indicated in the product above. 

DEPENDENT SENSORS INDEPENDENT OF TIME 
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Figure 2: Using the same data sets as in Figure 1, we plot the two dimensional representation of the data points 
as projected onto the first two principal components. The data points corresponding to the 'highly unlikely' 
values above fall outside the modal region in this plot. These points are indicated by diamonds. The benchmark 
data set (left) clearly shows the failure modes. However, this transformation does not show any signs of failure 
for the dynamic data set. 


Another approach that is often taken to analyze such data is based on the singular value decomposition 
(SVD) of the data set X/ [4]. This method works by factoring the matrix into three matrices as follows: 

X, = USV T (6) 

The matrix U is a T x N orthogonal matrix, S is an N x N diagonal matrix, and V is an N x T orthogonal 
matrix. SVD (which is closely related to principal components analysis) identifies the directions of maximum 
variation in the group of points defined by the rows of the data matrix. These directions can be shown to be the 
eigenvectors of the associated covariance matrix. Once the top m eigenvectors are identified (these correspond 
to those with the largest m eigenvalues), the observation vectors are left-multiplied with the eigenvectors. This 
results in an m dimensional representation of the observation vectors, where m < N. The parameter m is cho- 
sen in order to explain the maximum amount of variation in the data with the minimum number of eigenvectors. 

It has been shown that plotting the first two columns of the unitary matrix U can give diagnostic information 
about the individual data points in the original data matrix [8]. We next demonstrate the effect of analyzing 
these binary streams using SVD. Figure 2 shows the distribution of points in X for the two benchmark and the 
dynamic data sets. The outlying points in left hand side of the figure correspond to the relatively unlikely events 
that occur near time step 500 in the likelihood time series. 

These methods, while useful for analyzing high dimensional time series, do not capture the important dy- 
namic aspects of the system. Instead, they treats each row in the time series matrix as an independent observa- 
tion. More over, they amount to building an envelope detector in either the original space of the time series or a 
transformed space, as in the case of the SVD. We will next analyze a data set which is generated by a dynamical 
system which was a fault using dynamic models. 

DYNAMIC MODELS 

Of course one could choose to model ail manner of dependencies between bits by building a Fall blown 
Bayesian network with both the structure of the dependency graph and the conditional probability tables are 
learned. However, since we must keep things computationally inexpensive we limit ourselves to tree mod- 
els which can be learned in polynomial time. A comprehensive account of tree models can be found in Ma- 
rina Meila's thesis available at http://people.csail.mit.edu/people/mmp/thesis.html. The sim- 
plest tree model could utilise the Chow-Liu algorithm [?]. That algorithm is straightforward and efficient. The 
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model has the form P(x) = Yin=i ^(^(^)l^(Pa(n))) where Pa(n) is the parent of variable x(n)? The optimal 
parents can be determined by looking at the pairwise mutual informations with the details supplied in [?]. Miela 
gives improvements in the case where we have a very large number of columns of X (sensors). 


0.1 Non-Stationary Aspects 

In our approach regardless of how P(x) is modelled time dependence is addressed by building a model for 
P(X — P(xi,X2, • • ■ ) for all rows (times). One reasonable modelling possibility here relies upon the Markov 
assumption where we assume that x t is conditionally independent of all previous rows x t ' (with £' < t) except 
for Xt-i- In this case this simplification would give P(X) = P(x x ) JJ t>1 P(x t |x t _i). A naive hidden Markov 
model (HMM) is inappropriate to learn this model as any given row vector can assume 2 N possible values. Thus, 
we would first have reduce the number of features (perhaps by clustering with a mixture model) describing P(x) 
and then build an HMM over these features. 

Instead of pursuing this avenue we start with something simpler. We build a cruder description of the time 
dependence using a change point model. We assume that the model is stationary within different phases of the 
flight (perhaps takeoff, cruise, and landing), but that the behaviour between phases can differ. In the present 
context the change point model works as follows. For simplicity I assume 3 different phases T, C, and L but 
the idea is generalizable to any number of phases. In this case we also have very good ideas when the phases 
transition. 

Assuming £2 > £ 1 divide the data so that tt = 1 : £1 is the takeoff phase T, re — £1 + 1 : £2 is the cruise phase 
C, and tl = £2 + 1 : T is the landing phase. 4 Define X^, X^ 7 , X^ as X(tt , :), X(rc , :), and X(tl, :) respectively. 
Similarly D T = {Xf , X2 , • * • }, D c = {Xf , X^, • • • }, and D L — {Xf, X2 , • • * } are the corresponding data sets 
that can be used to train the models. Using these data sets we train a model exactly as using techniques described 
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P(X) = P T {X T )P C (X C )P L {X L ). 


What remains then is how best to determine the phases, i.e. how to best determine the phase boundaries t\ and 
^2* We do this in the standard way by maximizing the log likelihood with respect to these unknown parameters. 
The data sets D T , D c , and Dl depend on t\ and t<i and so different choices will gives different values for the 
empirical log likelihood 

J B(t 1 ,t 2 ) = ^logP(X / ) = ^{logP T (X T )+logP c (X c ) + logP i (X i )}. 

/ / 

Thus, we just optimize this quantity to determine the best boundaries. Naively, the optimization can be accom- 
plished by trying a set of possible boundaries and selecting the best from amongst this set. Less naively, we 
might hillclimb to a local peak in log likelihood under a neighbourhood where the neighbours of (£1,^2) ^0 
(ti + 1,^2)/ (£1 — M2), (£i ,£2 + 1 ), (£i,£ 2 — l)- 5 In this particular case where the phases correspond to take- 
off, cruise, and landing we have a good idea where the appropriate boundaries might be and we can focus the 
optimization in this area. 
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